{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Spam Filtering Using The [Enron Dataset][1]\n",
    "[1]: http://www.aueb.gr/users/ion/data/enron-spam/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from pymldb import Connection\n",
    "mldb = Connection('http://localhost/')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start by loading the dataset. We have already merged the different email files in a sensible manner into a .csv file, which we've made available online. Since this dataset is actually made up of six different datasets, we'll restrict ourself to the first one for simplicity, using a \"where\" clause."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<Response [201]>\n"
     ]
    }
   ],
   "source": [
    "print mldb.post('/v1/procedures', {\n",
    "    'type': 'import.text',\n",
    "    'params': {\n",
    "        'dataFileUrl': 'http://public.mldb.ai/datasets/enron.csv.gz',\n",
    "        'outputDataset': 'enron_data',\n",
    "        'named': \"'enron_' + dataset + '_mail_' + index\",\n",
    "        'where': 'dataset = 1',\n",
    "        'runOnCreation': True\n",
    "        }\n",
    "    })"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is what the dataset looks like.\n",
    "\n",
    "*index*: order in which the emails arrived in the user's inbox  \n",
    "*msg*: actual content of the email  \n",
    "*label*: was the email legitimate (*ham*) or not (*spam*)  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index</th>\n",
       "      <th>label</th>\n",
       "      <th>msg</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>_rowName</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_0</th>\n",
       "      <td>0</td>\n",
       "      <td>spam</td>\n",
       "      <td>Subject: dobmeos with hgh my energy level has ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_1</th>\n",
       "      <td>1</td>\n",
       "      <td>spam</td>\n",
       "      <td>Subject: your prescription is ready . . oxwq s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_2</th>\n",
       "      <td>2</td>\n",
       "      <td>ham</td>\n",
       "      <td>Subject: christmas tree farm pictures</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_3</th>\n",
       "      <td>3</td>\n",
       "      <td>ham</td>\n",
       "      <td>Subject: vastar resources , inc .gary , produc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_4</th>\n",
       "      <td>4</td>\n",
       "      <td>ham</td>\n",
       "      <td>Subject: calpine daily gas nomination- calpine...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_5</th>\n",
       "      <td>5</td>\n",
       "      <td>ham</td>\n",
       "      <td>Subject: re : issuefyi - see note below - alre...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_6</th>\n",
       "      <td>6</td>\n",
       "      <td>ham</td>\n",
       "      <td>Subject: meter 7268 nov allocationfyi .- - - -...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_7</th>\n",
       "      <td>7</td>\n",
       "      <td>spam</td>\n",
       "      <td>Subject: get that new car 8434people nowthe we...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_8</th>\n",
       "      <td>8</td>\n",
       "      <td>ham</td>\n",
       "      <td>Subject: mcmullen gas for 11 / 99jackie ,since...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_9</th>\n",
       "      <td>9</td>\n",
       "      <td>spam</td>\n",
       "      <td>Subject: await your responsedear partner ,we a...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                index label                                                msg\n",
       "_rowName                                                                      \n",
       "enron_1_mail_0      0  spam  Subject: dobmeos with hgh my energy level has ...\n",
       "enron_1_mail_1      1  spam  Subject: your prescription is ready . . oxwq s...\n",
       "enron_1_mail_2      2   ham              Subject: christmas tree farm pictures\n",
       "enron_1_mail_3      3   ham  Subject: vastar resources , inc .gary , produc...\n",
       "enron_1_mail_4      4   ham  Subject: calpine daily gas nomination- calpine...\n",
       "enron_1_mail_5      5   ham  Subject: re : issuefyi - see note below - alre...\n",
       "enron_1_mail_6      6   ham  Subject: meter 7268 nov allocationfyi .- - - -...\n",
       "enron_1_mail_7      7  spam  Subject: get that new car 8434people nowthe we...\n",
       "enron_1_mail_8      8   ham  Subject: mcmullen gas for 11 / 99jackie ,since...\n",
       "enron_1_mail_9      9  spam  Subject: await your responsedear partner ,we a..."
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mldb.query('select index, msg, label from enron_data order by index limit 10')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create a *sql.expression* that will simply tokenize the emails into a bag of words. Those will be our features on which we will train a classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<Response [201]>\n"
     ]
    }
   ],
   "source": [
    "print mldb.put('/v1/functions/bow', {\n",
    "    'type': 'sql.expression',\n",
    "    'params': {\n",
    "        'expression': \"\"\"\n",
    "            tokenize(msg, {splitChars: ' :.-!?''\"()[],', quoteChar: ''}) as bow\n",
    "            \"\"\"\n",
    "    }\n",
    "})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we can generate the features for the whole dataset, and write them into a new dataset, using the *transform* procedure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<Response [201]>\n"
     ]
    }
   ],
   "source": [
    "print mldb.post('/v1/procedures', {\n",
    "    'type': 'transform',\n",
    "    'params': {\n",
    "        'inputData': \"\"\"\n",
    "            select bow({msg})[bow] as *, label = 'spam' as message_is_spam\n",
    "            from enron_data\n",
    "            \"\"\",\n",
    "        'outputDataset': 'enron_features',\n",
    "        'runOnCreation': True\n",
    "    }\n",
    "})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is a snapshot of the sparse feature matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>/</th>\n",
       "      <th>24</th>\n",
       "      <th>Subject</th>\n",
       "      <th>agent</th>\n",
       "      <th>apache</th>\n",
       "      <th>can</th>\n",
       "      <th>coming</th>\n",
       "      <th>cornhusker</th>\n",
       "      <th>cynergy</th>\n",
       "      <th>daren</th>\n",
       "      <th>...</th>\n",
       "      <th>tadalafil</th>\n",
       "      <th>talkingabout</th>\n",
       "      <th>them</th>\n",
       "      <th>three</th>\n",
       "      <th>tongue</th>\n",
       "      <th>treatment</th>\n",
       "      <th>two</th>\n",
       "      <th>under</th>\n",
       "      <th>ve</th>\n",
       "      <th>viagra</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>_rowName</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_3056</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_2982</th>\n",
       "      <td>10</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_1331</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_2763</th>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_388</th>\n",
       "      <td>9</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_1775</th>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_2745</th>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_1100</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_2738</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>enron_1_mail_3067</th>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10 rows × 391 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                    /  24  Subject  agent  apache  can  coming  cornhusker  \\\n",
       "_rowName                                                                     \n",
       "enron_1_mail_3056   1   1        1      1       3    1       1           1   \n",
       "enron_1_mail_2982  10 NaN        1    NaN     NaN  NaN     NaN         NaN   \n",
       "enron_1_mail_1331 NaN NaN        1    NaN     NaN  NaN     NaN         NaN   \n",
       "enron_1_mail_2763   6 NaN        1    NaN     NaN  NaN     NaN         NaN   \n",
       "enron_1_mail_388    9 NaN        1    NaN     NaN    1     NaN         NaN   \n",
       "enron_1_mail_1775   2 NaN        1    NaN     NaN  NaN     NaN         NaN   \n",
       "enron_1_mail_2745   2 NaN        1    NaN     NaN    1     NaN         NaN   \n",
       "enron_1_mail_1100 NaN NaN        1    NaN     NaN  NaN     NaN         NaN   \n",
       "enron_1_mail_2738 NaN NaN        1    NaN     NaN    2     NaN         NaN   \n",
       "enron_1_mail_3067   6 NaN        1    NaN     NaN    2     NaN         NaN   \n",
       "\n",
       "                   cynergy  daren   ...    tadalafil  talkingabout  them  \\\n",
       "_rowName                            ...                                    \n",
       "enron_1_mail_3056        2      1   ...          NaN           NaN   NaN   \n",
       "enron_1_mail_2982      NaN    NaN   ...          NaN           NaN   NaN   \n",
       "enron_1_mail_1331      NaN    NaN   ...          NaN           NaN   NaN   \n",
       "enron_1_mail_2763      NaN    NaN   ...          NaN           NaN   NaN   \n",
       "enron_1_mail_388       NaN    NaN   ...          NaN           NaN   NaN   \n",
       "enron_1_mail_1775      NaN    NaN   ...          NaN           NaN   NaN   \n",
       "enron_1_mail_2745      NaN    NaN   ...          NaN           NaN   NaN   \n",
       "enron_1_mail_1100      NaN    NaN   ...          NaN           NaN   NaN   \n",
       "enron_1_mail_2738      NaN    NaN   ...          NaN           NaN   NaN   \n",
       "enron_1_mail_3067      NaN    NaN   ...            1             1     1   \n",
       "\n",
       "                   three  tongue  treatment  two  under  ve  viagra  \n",
       "_rowName                                                             \n",
       "enron_1_mail_3056    NaN     NaN        NaN  NaN    NaN NaN     NaN  \n",
       "enron_1_mail_2982    NaN     NaN        NaN  NaN    NaN NaN     NaN  \n",
       "enron_1_mail_1331    NaN     NaN        NaN  NaN    NaN NaN     NaN  \n",
       "enron_1_mail_2763    NaN     NaN        NaN  NaN    NaN NaN     NaN  \n",
       "enron_1_mail_388     NaN     NaN        NaN  NaN    NaN NaN     NaN  \n",
       "enron_1_mail_1775    NaN     NaN        NaN  NaN    NaN NaN     NaN  \n",
       "enron_1_mail_2745    NaN     NaN        NaN  NaN    NaN NaN     NaN  \n",
       "enron_1_mail_1100    NaN     NaN        NaN  NaN    NaN NaN     NaN  \n",
       "enron_1_mail_2738    NaN     NaN        NaN  NaN    NaN NaN     NaN  \n",
       "enron_1_mail_3067      1       1          1    1      1   1       1  \n",
       "\n",
       "[10 rows x 391 columns]"
      ]
     },
     "execution_count": 98,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mldb.query('select * from enron_features limit 10')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's train a very simple classifier, by training on half of the messages, and testing on the other half. This classifier will give a score to every email, and we can then choose a threshold where everything above the threshold is classified as spam, and every thing below as ham."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AUC = 0.992623778886\n"
     ]
    }
   ],
   "source": [
    "res = mldb.post('/v1/procedures', {\n",
    "    'type': 'classifier.experiment',\n",
    "    'params': {\n",
    "        'experimentName': 'enron_experiment1',\n",
    "        'inputData': '''\n",
    "            select \n",
    "                {* excluding(message_is_spam)} as features, \n",
    "                message_is_spam as label \n",
    "            from enron_features''',\n",
    "        'modelFileUrlPattern': 'file://enron_model_$runid.cls',\n",
    "        'algorithm': 'bbdt',\n",
    "        'runOnCreation': True\n",
    "    }\n",
    "})\n",
    "\n",
    "print 'AUC =', res.json()['status']['firstRun']['status']['folds'][0]['resultsTest']['auc']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is an impressive-looking AUC!\n",
    "\n",
    "But [the AUC score of a classifier is only a very generic measure of performance][1]. When having a specific problem like spam filtering, we're better off using a performance metric that truly matches our intuition about what a good spam filter ought to be. Namely, a good spam filtering algorithm should almost never flag as spam a legitime email, while keeping your inbox as spam-free as possible. This is what should be used to choose the threshold for the classifier, and then to measure its performance.\n",
    "\n",
    "So instead of the AUC (that doesn't pick a specific threshold but uses all of them), let's use as our performance metric the best [$F_{0.05}$ score][2], which gives 20 times more importance to precision than recall. In other words, this metric represents the fact that classifying as spam **only** what is really spam is 20 times more important than finding all the spam.\n",
    "\n",
    "Let's see how we are doing with that metric.\n",
    "[1]: http://mldb.ai/blog/posts/2016/01/ml-meets-economics/\n",
    "[2]: https://en.wikipedia.org/wiki/F1_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<Response [201]>\n"
     ]
    }
   ],
   "source": [
    "print mldb.put('/v1/functions/enron_score', {\n",
    "    'type': 'sql.expression',\n",
    "    'params': {\n",
    "        'expression': \"\"\"\n",
    "            (1 + pow(ratio, 2)) * (precision * recall) / (precision * pow(ratio, 2) + recall) as enron_score\n",
    "            \"\"\"\n",
    "    }\n",
    "})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>enron_score</th>\n",
       "      <th>ham_in_inbox</th>\n",
       "      <th>ham_in_junkmail</th>\n",
       "      <th>spam_in_inbox</th>\n",
       "      <th>spam_in_junkmail</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>_rowName</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>best_score</th>\n",
       "      <td>0.994153</td>\n",
       "      <td>1880</td>\n",
       "      <td>1</td>\n",
       "      <td>416</td>\n",
       "      <td>346</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            enron_score  ham_in_inbox  ham_in_junkmail  spam_in_inbox  \\\n",
       "_rowName                                                                \n",
       "best_score     0.994153          1880                1            416   \n",
       "\n",
       "            spam_in_junkmail  \n",
       "_rowName                      \n",
       "best_score               346  "
      ]
     },
     "execution_count": 100,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mldb.query(\"\"\"\n",
    "    select \n",
    "        \"falseNegatives\" as spam_in_inbox, \n",
    "        \"trueNegatives\"  as ham_in_inbox,\n",
    "        \"falsePositives\" as ham_in_junkmail, \n",
    "        \"truePositives\"  as spam_in_junkmail, \n",
    "        enron_score({precision, recall, ratio:0.05}) as *\n",
    "    named 'best_score'\n",
    "    from enron_experiment1_results_0\n",
    "    order by enron_score desc\n",
    "    limit 1\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, in order to maximize our score (i.e. to get very few ham messages in the junkmail folder) we have to accept a very high proportion of spam in our inbox, meaning that even though we have a very impressive-looking AUC score, our spam filter isn't actually very good!\n",
    "\n",
    "You can read more about the dangers of relying too much on AUC and the benefits of using a problem-specific measure in our [Machine Learning Meets Economics](http://blog.mldb.ai/blog/posts/2016/01/ml-meets-economics/) series of blog posts."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Where to next?\n",
    "\n",
    "Check out the other [Tutorials and Demos](../../../../doc/#builtin/Demos.md.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
